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urrent integrated circuit technology provides a full 
spectrum of arithmetic devices, varying in speed and 
capabilities. These devices can be divided into three 
groups according to their speed. In the first, or slowest, 
group are calculator chips that operate at greater than 
1 ms; in the second, or medium speed group (1 /j.s and 
slower), n-channel metal-oxide semiconductor micro- 
processors; and in the third, or fastest group (1 fis or 
better) , bipolar data slices and bipolar discrete units 
(adders, multipliers, etc). The third group of high 
speed arithmetic integrated circuits, under discussion 
here, are significant building blocks in constructing 
"number crunching" systems for use in weather model- 
ing, nuclear physics computations, and realtime digital 
signal processing tasks such as speech and image pro- 
cessing, computerized tomography, and air traffic mon- 
itors. 

Technology Background 

To understand the difficulties encountered in fabricating 
high speed arithmetic integrated circuits (ics), it is nec- 
essary to review semiconductor technology in general, 
with particular attention to bipolar technology. In this 
respect, consider the three factors that limit how much 
parallel arithmetic can be put into one chip: maximum 
allowed power dissipation, pin count, and cost. 



Maximum Allowed Power Dissipation 

In high speed technologies (mostly bipolar), maximum 
number of gates is a direct function of the chip's max- 
imum allowed power dissipation. This, in turn, depends 
on maximum allowed junction temperature of the 
silicon die, ambient temperature, and ability of the IC 
package to dissipate heat (thermal resistance). This 
relationship is defined as 

,, .... Injunction Tambient 

Max power dissipation — 

For military specifications, T amWent = 125 °C, a 
typical package at still air has a thermal resistance (0j a ) 
of 40 °C/W, and the maximum allowed junction tem- 
perature is 175 °C [for transistor-transistor logic (ttl)]. 
These values give 1.25-W maximum power dissipation, 
which is typical of most large-scale integrated (lsi) de- 
vices on the market today. However, effective thermal 
resistance between the IC package and the ambient envi- 
ronment can be reduced to 15 °C/W by attaching the 
package to a heat sink and cooling it by forced air. 1 
If IC operation is limited to the commercial temperature 
range (maximum 70 °C), maximum power dissipation 
is about 7 W. While this latter dissipation value seems 
feasible, available ics (with this characteristic) are un- 
desirable, because they will usually be interfaced and 
surrounded by ics that dissipate less than 1 W. Thus, a 



local hot spot is generated that is difficult to cool effi- 
ciently and reliably. 

The maximum number of gates that can be integrated 
into one chip is equal to the maximum allowed power 
dissipation divided by the power dissipation of each gate. 
For example, one of the most popular and successful 
technologies today is low power Schottky ttl (ls/ttl) 
which dissipates 2 mW/gate; this LSI chip type contains 
about 600 gates. 

In a given technology, power dissipation of each gate 
is roughly proportional to its physical size, which in 
turn is determined by number of active elements and 
resolution of the lithography used to define the geom- 
etries of the transistors. For random-access memories 
(rams) the progress from lk to 16k bits was mostly 
due to reducing the 3-transistor cell to a 1 -transistor 
cell and cutting the line width of the fabrication pattern 
from 10 to 5 /j.m. Introduction of the 64k RAM will 
probably require an improvement over present photo- 
lithography techniques, with electron beam lithography 
the most likely process. 

Pin Count 

Number of available pins in the IC package is a severe 
limitation for high speed arithmetic on wide words. 
Multiplying 16-bit operands requires a package with at 
least 66 pins. The state-of-the-art is 64 pins, and no 
major breakthrough is in sight 2 . Even custom ics do 
not exceed this limit by much, although the Amdhal 
computer uses a custom 84-pin flat-pack package. Pin 
count limitations have led to the data "slice" concept; 
ie, partitioning the desired system into identical parts 
that have common control and interconnection mech- 
anisms (carry, etc). The pin limitation can also be cir- 
cumvented by time-multiplexing the information via 
common pins; however this reduces the overall speed. 

Cost 

If power dissipation is not a limiting factor, as with 
metal-oxide semiconductor (mos) and integrated-injec- 
tion logic (iil) technologies, cost will limit the die size. 
Noyce 3 points out that if yield is a function of random 
defects, cost increases exponentially with die size. For 
example, if a given chip size yields 10% good die, a chip 
twice as large will yield 1%. The cost for twice the func- 
tion will be 20 times as great. Other major cost elements 
are testing and assembly, which are approximately fixed 
per chip. Consequently, as the number (N) of functions 
per chip increases, the assembly cost decreases propor- 
tionally to 1 /N. Minimum cost per function will be at the 
crossover point between silicon chip cost and assembly 
and test cost. Progress in ics can be viewed as a fixed 
cost (about $10) for increased complexity (three orders 
of magnitude in the last 15 years). 2 

Characterizing Technologies 

To simplify comparisons among the various technologies, 
the "natural" gate implementation is analyzed. Natural 
gate is a realization of a Boolean operator that requires 
a minimum number of transistors while giving maximum 
speed. For ttl, this gate is a nand; for ecl, it is a NOR. 
Unfortunately, no known technology has the exclusive-OR 



as its natural gate. Unless otherwise stated, the word 
gate will be used instead of natural gate. 

Two main characteristics of a gate are power and 
speed. A common figure of merit is the speed-power 
product, where speed is in nanoseconds (ns), power is 
in milliwatts (mW), and product is in picojoules (pj). 
A third characteristic is the gate's fan-in. In most bi- 
polar technologies, fan-in is four. In the ttl family, 
nand gates with one, two, three, or four inputs have 
the same speed and the same power, whereas larger 
fan-in NAND gates have larger speed-power products. 
Limited fan-in is a severe limitation in arithmetic op- 
erations on wide words requiring carry-lookahead (cla). 

Bipolar Technologies 

All fast ic technologies are bipolar; the two most 
commonly used today for high speed arithmetic hard- 
ware are ecl and Schottky -ttl. Integrated-injection 
logic (iil) is a relatively new technology that will be- 
come important for very large-scale integration (above 
1000 gates/chip). 

Emitter-Coupled Logic 

ECL is the fastest, commercially available technology; 
gate delay is 1 to 2 ns, and speed-power product is 50 
pj. Although introduced about 10 years ago, the tech- 
nology is still limited to applications that require 
very high speed, such as top of the line mainframe 
computers (ibm 370/168, Amdhal 470, DECsystem 10) 
and some signal processing equipment because of design 
difficulties. Interconnections become, in effect, trans- 
mission lines that require proper termination and 
matching. Also, reliable ECL designs need to use multi- 
layer printed circuit (pc) boards, which are expensive. 
Since ECL devices normally are powered from a — 5.2- 
V supply, they are incompatible with the popular TTL 
family. Another severe shortcoming of the ecl devices 
is their excessive power dissipation of 25 to 60 mW/ 
gate, which requires forced air cooling. Nevertheless, 
if maximum speed is needed, ecl is the best choice. 
The internal circuit of ecl is a current-switching mech- 
anism, which implies a constant current drain of the 
power supply. By contrast, the TTL logic family, using 
voltage threshold, causes large current spikes on the 
power supply during switching from one state to another. 

ECL technology is implemented by various families. 
The fastest (1 ns) is mecl hi from Motorola, 4 Fairchild, 
and Signetics. The 10,000 family has the largest selec- 
tion of ECL LSI devices, and it is the only ecl family 
with arithmetic units. A typical gate operates at 2 ns and 
25 mW. 

Transistor-Transistor Logic 

ttl technology was introduced in 1964 by Texas Instru- 
ments (ti). 5 It has been the most popular logic family 
for more than a decade. The original family had a 10-ns 
gate at 10-mW dissipation, and devices of this family are 
still the least expensive ics. However, two subfamilies are 
making inroads. One is the Schottky-TTL (s/ttl) with a 
3-ns gate at 20 mW, matching the speed-power product 




Technology/ 
Year Introduced 

ECL-III (1968) 
ECL-1000 (1971) NOR 
5/TTL (1970) NAND 
TTL (1972) NAND 
ML (1975) NAND 
NMOS (1973) 
EEIC (1977) 




TABLE 1 

Comparison of Common Bipolar Technologies 
Used in Implementing High Speed Arithmetic Devices 

Gate Characteristics 



Delay Power 

1.1 ns 60 mW 

2 ns 25 mW 

3 ns 20 mW 
10 ns 2 mW 
10 ns 0.1 mW 
100 ns 0.1 mW 
0.25 ns 2 mW 



2-lnput Exclusive-OR 

Speed-Power Density Speed-Power 
Product (gates/mm 2 ) Delay Power Product 



66 pJ 
50 pJ 
60 pJ 
20 pJ 
1 pJ 
10 pJ 
0.5 pJ 



30 
30 
30 
30 
300 
130 
100 



1.3 ns 70 mW 

2.5 ns 50 mW 

7 ns 60 mW 

10 ns 8 mW 



91 pJ 
125 pJ 
420 pJ 
80 pJ 



Comments 

Limited number 
of functions 

Large selection 
of functions 

Large selection 
of functions 

Large selection 
of functions 

Not a mature 
technology 

For reference 
only 

Still in re- 
search and 
development 



of ecl. The second subfamily, low power Schottky (ls/ 
TTl), retains the speed of original TTL but decreases the 
power dissipation to 2 mW. Popularity of TTL technology 
has led to the largest selection of different ICs. Most small- 
scale integrated (ssi) and medium-scale integrated (msi) 
devices are triplicated in three subfamilies (ttl, s/ttl, 
ls/ttl) ; however, most LSI devices are implemented only 
by ls/ttl. ttl ics operate from a 5-V power supply. No 
critical problems have envolved in PC board layout as in 
ecl, and power dissipation typically does not require any 
special cooling. As noted previously, precautions need 
to be taken in decoupling the power supply lines, due to 
current spikes that are present while switching from one 
state to another. 

Integrated-lnjection Logic 

IIL is a relatively new technology (1975) that has not 
matured like ECL and ttl; thus, conflicting reports exist 
about its potential characteristics. Nevertheless, a gate 
with ls/ttl speed of 10 ns and power dissipation of 
0.1 mW is reported. 7 It is likely that by 1980, most high 
density monolithic arithmetic processors will be imple- 
mented in iil, replacing ls/ttl completely for such ap- 
plications. In fact, the photochemical process used in 
fabricating ls/ttl can be modified to handle iil, making 
it even more attractive than a completely new technology. 

Most digital iil devices are powered from a 5-V power 
supply to retain ttl compatibility. However, iil tech- 
nology needs only 1 V to operate, since it requires 
only current sourcing, and no voltage thresholds are 
used. Thus, for applications where ttl compatibility 
is not required, further power reduction is possible 
at no sacrifice in speed. 

Table 1 summarizes the characteristics of the de- 
scribed technologies. A characterization of the implemen- 



tation of an exclusive-OR gate is included for each 
technology because this Boolean operator is the major 
element in implementing digital arithmetic. 

Arithmetic Elements 

Arithmetic Logic Unit 

Arithmetic logic units (ALUs) 4 ' 8 are capable of add, sub- 
tract, shift, and logic operations (and, or, ex-cm). The 
most popular ALU device is the 74S181 implemented in 
s/ttl technology (or 74181), which is used in minicom- 
puters such as the dec pdp-11 and the Data General NOVA. 
This device performs addition using a carry-lookahead 
algorithm across four bits at a time. When operating 
on wider words, a companion device (74S182) provides 
a full carry-lookahead across any number of bits. Each 
carry-lookhead unit (74S182) receives the generate 
and propagate terms from a group of four 74S181s. 
In general, the number of levels of carry-lookahead is 
log 4 n; eg, adding 64 bits with full carry-lookahead 
takes 15 ns in ECL and 28 ns in s/TTL. 

The 74S181 (and the 10181) are combinatorial de- 
vices, and accumulation of results requires an additional 
register (accumulator). The 74S281 is an ALU with an 
accumulator on one chip, which still uses the 74S182 
for carry-lookahead. Adding and storing 64-bit operands 
take 42 ns. 

Texas Instruments has introduced two additional 
-81 devices. The 74S381 is similar to the 74S181 — some 
functions of the 74S181 were removed to enable it to 
be packaged in a 20-pin, 0.3" (7.6-mm) package instead 
of the 24-pin, 0.6" (15.2-mm) package used to house 
the 74S181. The second device, the 74S481, will be 
discussed in the section dealing with processor elements. 



Table 2 lists the speed and power of commonly used 
ALUs. As can be expected, power is approximately pro- 
portional to the gate count of the device. The first three 
devices listed in the table are basic ALU elements. 74S181 
is implemented in s/ttl; the 10181 in ecl. 74S182 
and 10179 serve as support chips for the ALU, and pro- 
vide carry-lookahead when several ALUs are cascaded. 
The 74S281 contains an on-chip accumulator, but is 
slower than the combinatorial ALUs. Table 3 extends 
the chip comparison into the system level. When 
several ALUs are cascaded, the carry-lookahead units 
provide faster addition and subtraction, at the expense 
of increased power dissipation, s/ttl addition of 64 
bits takes only 28 ns, but results in 11-W power dissipa- 
tion. Addition of only 4 bits does not require carry- 
lookahead, addition of 16 bits requires only 1 carry- 
lookahead, but addition of 64 bits uses 5 carry-look- 
ahead units. 

Multipliers 

In describing the ALU adder ics it is clear that the 
most common addition algorithm is the carry-lookahead, 
and the most common configuration is four bits per 
ic. In contrast, multipliers use a variety of algorithms 
and configurations. 

Parallel multiplication algorithms can be divided into 
two types: those that generate partial products and 



those that add the partial products. In generating 
partial products. The n bits of the ith partial product 
method is to use AND gates. If the multiplier (Y) and 
the multiplicand (X) each have n bits, there are n 
partial products. The n bits of the ith partial product 
are generated by ANDing Yi with each of the n bits of 
the multiplicand X. However, in 2's complement rep- 
resentation, a correction is required since the most 
significant bit (msb) has, effectively, a negative weight. 

Booth's algorithm 9 is a method of recoding the multi- 
plier so that the sign bit (msb) is treated in the same 
way as the rest of the bits. A modified Booth's algorithm, 
suggested by MacSorley, 10 serves as a means of halving 
the number of partial products while keeping the elegance 
(sign bit treated as any bit) of the original algorithm 
for 2's complement numbers. The reduced number of 
partial products increases multiplication speed and de- 
creases gate count. The motivation behind Booth's al- 
gorithm is to skip over a string of Is and 0s, rather than 
form a partial product for each bit. Skipping a string 
of 0s is clear. Skipping over a string of Is involves 
computing a string of Is by subtracting the weight of 
the rightmost 1 from the modulus of the string. For ex- 
ample, the binary string 1111 is 2* — 2° = 15, and the 
binary string 11100 is 2 5 - 2 2 = 28. 

In the actual hardware implementation, Booth's al- 
gorithm requires that the operand (multiplier) be di- 
vided into N/2 groups or substrings, each of which has 



TABLE 2 
Comparison of Arithmetic Logii 




Part No. 


Function 


Gate Count 


Speed Power 


74S181 


4-bit ALU 


75 


11 ns 600 mW 


10181 


4-bit ALU 


75 


7 ns 600 mW 


74S281 


4-bit ALU/ 
accumulator 


100 


22 ns 700 mW 


74S182 


4 groups 
carry-lookahead 


20 


7 ns 350 mW 


10179 


4 groups 
carry-lookahead 


20 


4 ns 300 mW 
TABLE 3 






Comparison of Speed/ Power 




Part No. 


Technology 


4 Bits 




74S181/ 
74S182 


S/TTL 


11 ns/600 mW 




10181/ 
10179 


ECL 


7 ns/600 mW 



Speed/ Power 
16 Bits 



64 Bits 



three bits. Assume that the multiplier has the binary 
pattern 0111. A digit is added to the right, and the 
resultant number (OHIO) is divided into two 3-bit 
groups. All possible permutations of these substrings 
are computed from the chart to determine the partial 
products. 



2 1 


2" 


2" 1 






Yi 


Yi-i 













Add zero (no strings) 








1 


Add multiplicand (end of string) 





1 





Add multiplicand (a string) 





1 


1 


Add twice the multiplicand (end of string) 


1 








Subtract twice the multiplicand (beginning of string) 


1 





1 


Subtract the multiplicand ( — 2X + X) 


1 


1 





Subtract the multiplicand (beginning of string) 


1 


1 


1 


Subtract zero (center of string) 



The first group is 110, which requires subtracting the 
multiplicand; the second group (Oil) requires adding 
twice the multiplicand. Since the second group is shifted 
twice, its relative weight is four times that of the first 
group. Thus, "adding twice" for the second group means 
"adding eight times." Combining the two groups, the 
result is "add seven times the multiplicand." This method 
requires only six easy operations: ±0, ±X, ±2X. 

Every two contiguous groups have one bit in common 
as follows. 



Y 7 Y„ Y s Y 4 Y 3 Y 3 Yi Y„ Y-a 
Li] 

B 

C \ 

D | 



This padded 8-bit multiplier is divided into four groups, 
each made up of three bits. Each of the groups is oper- 
ated upon by encoding the previous definitions. How- 
ever, each group has a different weight. Group A has a 
weight of 1, group B a weight of 4, and so on. Note 
that bit —1 is always 0. 

Thus the modified Booth's algorithm is a multiplier 
encoding scheme that involves a constant shift of two 
bits at a time while examining three multiplier bits, 
resulting in N/2 partial products rather than the N 
partials involved without encoding. This algorithm can 
be extended by shifting three bits at a time while ex- 
amining four bits at each subgroup. However, in encod- 
ing some permutations of a 4-bit string, such as 0110, 
the partial product is three times the multiplicand. Since 
generating a multiplication of three is not as trivial as 
the shifting used in generating a multiplication of two, 
none of the semiconductor multipliers use more than 
three bits for encoding. 

The second type of multiplication algorithm deals 
with adding the partial products. All parallel algorithms 
use the carry-save adder; this adder is identical to a 



Vendor/ 
Device 

TRW/ 
MPY-8 

TRW/ 
MPY-12 

TRW/ 
MPY-16 

MMI/ 
67558 

MMI/ 
67516 



MMI/ 
67508 

AMD/ 
25S05 

AMD/ 
25LS14 

AMD/ 
25LS2516 

Tl/ 

74S274 
(ROM)* 

Motorola/ 
10183 



Configuration 
8x8 

12 x 12 

16 x 16 

8x8 

16 x 16 

8x8 

2x4 
8 x 1 

8x8 

4x4 

2x4 



Pins 
40 

64 

64 

40 

24 

20 

24 
16 

40 

20 

24 



TABLE 4 

Comparison of Available Multiplier ICs 



Speed 
130 ns 

150 ns 

180 ns 

100 ns 

800 ns 



Power 
1.2 W 

3.5 W 

5 W 

1 W 

1 W 



400 ns 1 W 



Data Code 
2's comp 

2's comp 

2's comp 

2's comp/ 
unsigned 

2's comp 



400 ns 0.75 W 2's comp 

25 ns 0.6 W 2's comp 

50 ns 0.5 W 2's comp 



2's comp 



50 ns 0.5 W Unsigned 



20 ns 



0.8 W 



2's comp 



Amount of 
Parallelism 

Full 
Full 
Full 
Full 

Multiplicand/ 
2 multiplier 
bits 

Multiplicand/ 
2 multiplier 
bits 

Full 

Multiplicand/ 

1 multiplier 
bit 

Multiplicand/ 

2 multiplier 
bits 

Full 



Full 



Algorithm 

AND gates and carry- 
save adders 

AND gates and carry- 
save adders 

AND gates and carry- 
save adders 

Modified Booth's; 
modified Wallace Tree 

Modified Booth's 



Modified Booth's 

Booth's 
Booth's 

Modified Booth's 
ROM lookup table* 
Booth's 



Vendor/ 
Number 

MMI/ 
67558 

TRW/ 
MPY-8 

Motorola/ 
10183 

AMD/ 
25S05 

Tl/ 

74S274 

TRW/ 
MPY-16 

MMI/ 
67516 

AMD/ 
25LS2516 



TABLE 5 

Performance Comparison Between 8x8 and 16 x 16 Multiplication 



Configuration 
8x8 

8X8 

2x4 

2x4 

4x4 
16 x 16 
16 x 16 

8x8 



Pins 
40 

40 

24 

24 

20 

64 

24 

40 



8x8 Multiplication 

No. of 
Packages 



1 

1 
8 
8 
12** 



Speed 
1 00 ns 

130 ns 

50 ns 

75 ns 

75 ns 



400 ns 




16 x 16 Multiplication 

No. of 
Power Packages 

1 w 14* 

1.8 W 14* 

6.4 W 32 

5 W 32 

5.4 W 45 

— 1 

— 1 
1 W 2 



100 ns 
150 ns 
120 ns 
180 ns 
800 ns 
800 ns 



Power 
9 W 

10 W 

25.6 W 

20 W 

21 W 
5W 

1 W 

2 W 



*4 packages are 8 x 8 multipliers, 10 are adders (74S181/10181) 
"*4 packages are 4x4 multipliers, 8 more are Wallace Tree bit-slices (74S275) 



binary full adder. The only difference is in the inter- 
connections of carries; instead of waiting for the carry 
to ripple, the carry is added at a later stage. Postpone- 
ment of the addition of carries can be extended to all 
adder stages except the last. Carries from the last stage 
essentially form an n-bit operand to be added to the 
n-bit sum, and this operation can be done by carry- 
lookahead adders. Another side benefit of postponing 
addition of the carries to a later stage is the availability 
of a third input in each adder in the first stage; thus, the 
first three partial products can be added by the first 
stage, reducing the number of adder stages by one. This 
scheme is a minor modification of the Wallace Tree. 11 

Table 4 summarizes available ic multipliers and the 
algorithms that they use. For example, the TRW multi- 
pliers 12 use and gates to generate partial products which 
are added by carry-save-adders, however funlike the 
Wallace Tree), they let the carries ripple through the 
last adder stage. 

The MMI 8x8 multiplier 1 " (67558) generates partial 
products by using the modified Booth's algorithm. These 
partial products are then added in a Wallace Tree config- 



uration. Texas Instruments extends the and gate concept 
and provides a ROM (74S274) to generate a 4 x 4 segment 
of the partial products. They also provide a 7-bit Wallace 
Tree slice (74S275) for use in adding the partial prod- 
ucts. This bit-slice can be used with other multipliers to 
provide an expanded multiplication. AMD 25S05 14 and 
the Motorola 10183 provide an onchip solution to the 
expansion problem by implementing X times Y plus K 
instead of just X times Y. 

Some multipliers in Table 4 are semiparallel. The AMD 
25IS14 generates and accumulates one partial product 
at each clock pulse; this product is made up of one 
multiplier bit and the full width of the multiplicand. Thus, 
for 8x8 multiplication, eight clock pulses are required. 
The mmi 67516 is similar except that it shifts two bits 
at a time and the multiplicand is 16-bits wide. Thus, 
16 x 16 multiplication is performed in eight cycles. 

Table 5 compares performance of the various chips 
in performing 8x8 and 16 x 16 multiplication, and 
can be used to make the engineering tradeoff in system 
design of multiplication. If maximum speed is needed, 
say for 8x8 multiplication, then using eight packages 
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Fig 1 Detailed AM2901 microprocessor block diagram. Main element of 2901 is ALU, which performs eight operations 
according to three instruction lines. Source and destination of operands and results are determined by remaining six in- 
struction lines. Register file, in upper left corner, is 16 x 4 dual-port RAM, which is used as 16 registers or accumulators 
for ALU. External input and output data buses are also used as source and destination, respectively 



of the ECL 10183 is the best choice; however, if the 
associated power dissipation of 6.4 W is excessive, the 
single chip MMI 67558 is best. While it multiplies some- 
what slower than the ecl device, its power dissipation 
is only 1 W. 

Bit-Slice Processor Elements 

The first arithmetic ic was a binary adder, which was 
integrated later to the combinatorial ALU (74181). The 
next evolution of integration (after the alu) received 
many different names, such as bit-slice microprocessor, 
RALU (ALU with registers), and data slice. The former 
is the most popular description. Architecture of the bit- 
slice is made typically from the classic ALU, but with 
multiple accumulators (registers) and control over ALU 
sources and destination. Fig 1 shows the architecture of 
the AMD-2901, 1 " which is probably the most commonly 
used bit-slice processor element. Other such elements 
are Motorola mot 10800, mmi 6701, Intel 3000, ti 
74S481, and ti sbp 400. All have four bits per ic, 
except the Intel 3000 which is only two bits wide. 



To determine the speed of these devices, the register 
to register operation (R A + Rii — » Rb) * is examined. 
The 2901A, a faster version of the 2901, performs such 
an operation in 90 ns. To compare this speed with that 
of the 74181 type ALU, it is necessary to add registers 
to the 74181, as shown in Fig 2. 

Expanding the bit slices to handle more than four 
bits is similar to expanding the 74181 ALUs. In fact, 
most bit-slice vendors recommend using the same carry- 
lookahead unit (74S182). To get a more realistic picture 
of the actual throughput of these devices, it is neces- 
sary to assume some overall system of architecture. Fig 
3 shows the architecture for a 16-bit system. It is as- 
sumed to be microprogrammable with a pipelined micro- 
instruction register; ie, maximum speed is achieved 
when no branch decisions are made. Fig 3 also contains 
a comparison of 16-bit throughput for various building 
blocks. In computing the addition time of the ALUs 



*Ra + Rb -* Rb means adding the contents of register A to the 
contents of register B and storing the sum into register B. All of 
this is accomplished in one cycle. 




•WORST CASE DELAY IS 607. MORE THAN 
THE TYPICAL DELAY 
M 



ECL 


S/TTL 


LS/TTL 


10800 


74S181 


2901 A 













(b) 



Fig 2 MSI emulation of bit- 
slice processor element, (a) 
To compare simple ALU with 
bit-slice processor element, 
two registers are added to 
ALU. (b) Time to perform 
four bits (Ra + Rb-»Ra) is tab- 
ulated for various building 
blocks 



(74S181 and 10181), the configuration assumed is 
similar to that of Fig 2, with the addition of a multiplexer 
(to select one of several sources) and a microinstruction 
register. From Fig 3(b) it can be seen that the speed 
of the arithmetic processor elements is about half that 
of the older ALUs. This ratio applies to both TTL and 
ecl technologies. Slower speed of these elements is prob- 
ably the main reason that many recent minicomputers 
still use the older ALUs. 

Monolithic Arithmetic Processors 

The mmi 67516 (mentioned in the discussion on 
multipliers) continues the trend toward greater integra- 



tion per chip. In addition to multiplication, the device 
performs division using a nonrestoring algorithm; thus, 
for 16-bit division, 20 clock pulses are needed. At 
100 ns/clock, 2 /xs are required to perform a division 
of a double-length dividend by a single-length divisor, 
resulting in a single-length quotient and remainder. The 
device contains four registers, two of which are used 
as accumulators, making it easy to perform (under 
microprogram control) a variety of operations, including 
sum of products, multiplication by constant, and division 
by constant. 

Development of this device was originally motivated 
by speech processing requirements, where 16-bit multipli- 
cations performed in 1 /xs are used to implement digital 
filter equations. Later, it was realized that most of the 
data paths and registers needed for division already 
existed; this allowed division capability to be incorpo- 
rated into the chip with a small increase in hardware 
and expansion of the microprogram. 

Future Trends 

To provide perspective on currently available arithmetic 
ics, the limitations imposed by bipolar technology were 
analyzed. Now to understand the future of arithmetic ICS, 
bipolar technologies must be examined in terms of 
potential density and speed. 

Two major density improvements are in sight. The 
first is an increase of the wafer diameter from 3 to 4 
in (7.6 to 10.1 cm). This increase, which has been 
implemented by some companies, doubles the chip size 
while maintaining the same cost (up to a limit, the 
cost of processing a wafer is almost independent of its 
diameter). The second, more revolutionary, improvement 
is a new method of drawing the patterns needed for the 
fabrication of integrated circuits. Resolution currently 
obtained with optical lithography is lines approximately 
2 to 5 /mi wide. Further resolution in optical tech- 
niques is limited by diffraction effects that occur between 
the mask and the wafer. The new method — electron-beam 
lithography — provides up to 20 times the resolution of 
optical lithography. 16 

With these improvements, Texas Instruments expects 
to achieve a chip size of 140,000 sq mils. 17 With iil 
technology, this chip size will contain about 10,000 
gates. At this high level of integration, many arithmetic 
ICs will probably give way to a single-chip high speed 
microprocessor that performs addition, multiplication, 
and division all at the same speed. Thus, one direction 
in the future will be further integration of multichip 
systems into a single chip while maintaining the same 
system speed. 

A second possible future trend will be to retain the 
same level of integration but to employ higher speed 
technology. A new bipolar process — elevated electrode 
integrated circuit (eeic) 18 — is reported to have 250-ps 
delay /gate at 2-mW power dissipation. With such a 
technology (still under research and development), it 
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will be possible to construct an ALU that is functionally 
similar to the 74S181, but with an order of magnitude 
speed improvement, ie, 1 ns. A 16-bit computer made 
with this ALU and carry-lookahead units could perform 
register to register addition in 7 ns, if a sufficient variety 
of devices in the new family exist to build such a com- 
puter. 

The future directions outlined are merely interpolations 
of the progress to date. As with the microprocessor revo- 
lution, it is possible that a completely new direction in 
computer arithmetic will emerge from the continuing 
advancements in IC technology. 
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